In this post I'll show you how to simulate a large dataset of mortality claims. Sometimes it's useful to use simulated data. For example, if you wanted to experiment with new modelling techniques but were unable to find a useful dataset, it may be easiest just to create your own. Alternatively, you may simply wish to supplement existing data with simulated data. If you have some assumptions about the variables you want to create, it's possible to create entire datasets from scratch.
Example¶
Imagine that you wanted to experiment with fitting different predictive models to life insurance claims data (perhaps to practice your coding skills or to explore an idea) but didn't have all the data required. Rather than spend time searching for or collecting large volumes of data appropriate for your specific problem, you could simply (and quickly) create a simulated dataset.
I will show you how to create a dataset representing a portfolio of lives with lump sum life insurance cover, but the following techniques can be used to create all sorts of data. We will end up with a large dataset where each row will represent one life and the columns will contain information about that persons age, salary, occupation, gender, location and, finally, whether or not that person died during a specific period.
In order to do this we will need to come up with some assumptions about the variables we wish to create. The assumptions I will make will be very simple and are listed below.
Gender: 40% of the sample will be Male, 60% female
Occupation: 3 categories representing different occupations. 50% of the lives will be in category 1, 20% in category 2 and 30% in category 3
Location: 4 categories representing different areas of residence. 10% will be in category 1, 20% in category 2, 50% in category 3 and 20% in category 4
Salary: Gamma distributed. I will assume that salary varies by age and occupation
Exposure: This variable represents the length of time (in years) that each person was covered for. This will come from a uniform(0,3) distribution (i.e the maximum cover period is 3 years)
Claim: I will assume that the probability of death is a formula linked to age and occupation. The base level of mortality will be $\frac{Age}{20000}$. Mortality increases by 20% for those in occupation category 1 and by 15% for those living in locations 3 or 4
Creating the data in R¶
Setting Assumptions¶
I'll assume you have R Studio installed and some experience with basic R programming. If you have not used R before, I recommend visiting https://www.datacamp.com and taking the free intro to R course. We will be using the simstudy package. If you have not yet installed this package, run install.packages('simstudy')
to do so.
Simstudy first requires us to generate a table containing our assumptions. Each row in this table will contain information relating to one of our variables. We will need to specify a name for each variable, the distribution we wish to simulate from and formula for the mean of each variable (and in some cases, variance). To create the assumptions data, use the defData()
function.
First, load the necessary libraries.
library(simstudy)
library(ggplot2)
# initialise our assumptions table and add the Age variable
def = defData(varname = 'Age', dist = 'normal', formula = 40, variance = 25)
# the assumptions table has now been created and can be viewed
def
To add to this table, use the defData()
function again, but this time specifying the name of the table we've just created using the dtDefs
argument. For example, add the gender, occupation, location and exposure variables as follows:
# add a binary variable to indicate gender - 40% of lives will be Male
# and will be represented by as 1's
def = defData(dtDefs = def,varname = "Gender", dist = "binary", formula = 0.4)
# add a categorical variable to define occupation type.
# There will be three classes.
def = defData(dtDefs = def,varname = "Occupation", dist = "categorical",formula = "0.5;0.2;0.3")
# add the location variable to the data table
def = defData(dtDefs = def,varname = "Location",dist = "categorical",formula = "0.1;0.2;0.5;0.2")
# add a variable to indicate 'exposure years'
# (in other words, the length of time an individual was insured)
def = defData(def, varname = 'Exposure', dist = "uniform", formula = "0;3")
# the assumptions table now looks like:
def
Since salary and claim depend on other variables, the 'formula' argument for these variables will be a bit more complicated. I have used the paste()
function to specify the formulas to be passed into defData()
.
# add a salary variable - the mean of which will increase
# with age and will vary by occupation class.
# Use a gamma distribution
salary_formula = paste("ifelse(Occupation == 1, Age * 800,
ifelse(Occupation == 2, Age * 1200, Age * 1350))")
def = defData(def,varname = "Salary", dist = "gamma",
formula = salary_formula,variance = 0.2)
# Now, add the binary claim indicator
# (the probability of death varies as specified earlier)
claim_formula = paste("(Age * 1/20000)",
" * ifelse(Occupation == 1,1.2,1)",
" * ifelse(Location %in% c(3,4), 1.15,1)",
sep = "")
def = defData(def, varname = "Claim", dist = "binary",formula = claim_formula)
# the final assumptions table now looks like:
def
Generating the Data¶
Now, actually generating the data is easy. Do this using the genData()
function. The first argument, n, is how many records you want to simulate. Then just pass in the name of the assumptions table. I will simulate 1.5m records.
# simulate 1.5m records
dt = genData(n = 1500000,def)
# view the first few rows
head(dt)
Exploring the Simulated Data¶
Now, we can summarise and visualise the dataset that has been created to check that everything is as expected. The summary()
and table()
functions are very useful and the ggplot2 library is particularly good for creating visualisations. Use the ggplot()
function to create density plots, bar charts and scatter plots to check that distributions have been created according to our specified assumptions and to check the relationships between variables. If you have not installed ggplot2, execute install.packages('ggplot2')
First, use summary()
to see statistics for the entire dataset. Then use table()
and prop.table()
to see how claims vary by occupation and/or location.
# Summary stats
summary(dt)
print("Claim rate by occupation")
# prop table can be used to convert the tables above to percentages
print("Occupation")
round(prop.table(table(dt$Occupation, dt$Claim), margin= 1),5)
Next, we can use density plots to check that ages are normally distributed and that salaries are a) gamma distributed and b) vary by occupation.
# view distribution of ages
ggplot(dt, aes(x=Age)) + geom_density() + theme_bw()
It looks like age has been simulated as expected. The curve is bell shaped, the mean age is clearly 40 and most people are aged between 30 - 50, which is also in line with expectations given the variance we specified. To check that salary has been correctly simulated, repeat the ggplot()
function but replace Age with Salary in the aes()
(aesthetic) argument. We will also include an additional argument - facet_wrap()
- to generate different density plots for each occupation group.
# view distribution of salary by occupation class
ggplot(dt, aes(x = Salary)) +
geom_density() +
facet_wrap(~Occupation) +
theme_bw() +
ggtitle("Salary Dist. by Occupation Category")
These look like gamma distributions and salary seems to be highest for occupation category 3 - which is correct!
In our assumptions table we state that salary should increase with age. We can check this using a scatter plot. Use ggplot()
again but with geom_point()
instead of density. Also, add y = Salary
to the aesthetic.
# plot sample of age against salary
# (plot first 1000 rows for a better visual. 'pch = 21' controls the type of point)
ggplot(dt[1:1000,], aes(x = Age, y = Salary)) + geom_point(pch = 21) + theme_bw() +
ggtitle("Age/Salary Scatter")
Finally, view a bar chart showing counts of each location category. This time use geom_bar()
and add facet_wrap()
to view separate bar charts for each occupation class.
# distribution of locations by occupation
ggplot(dt, aes(x = factor(Location))) + geom_bar() + theme_bw() +
facet_wrap(~Occupation) +
xlab("Location") +
ggtitle("Location by Occupation Category")
Creating new Features¶
This last section will demonstrate how we could add some simple extra features to the dataset. These new features will be created using existing variables. Currently, age is exact and so we have lots of unique values. To reduce the number of distinct values I will create an 'Age_Last' variable. In addition, sometimes it is useful to transform numerical variables to categorical and so I will show you how to create a 'Salary_Band' feature from the numerical salary variable. This will group salary into a number of different categories which we will treat as nominal(you may also wish to treat these as ordinal factors).
# create an 'Age_Last' variable from Age exact that we simulated earlier
dt$Age_Last = floor(dt$Age)
# we can also create a categorical variable from the raw numeric
# salary variable - e.g group into bins
# We will create groups using a helper function and custom defined breaks
# Bins can include, for example, those earning <25k, 25-35k, 35-45k, 45-75k,75k+
# (I've just arbitrarily picked these)
quant_bin = function(x, breaks){
bin = as.numeric(cut(x,
breaks = breaks,
include.lowest = T, right = T))
bin
}
breaks = c(0,25000,35000,45000,75000, max(dt$Salary))
dt$Salary_Band = quant_bin(dt$Salary, breaks)
# Now, view the first few rows of the data again
head(dt)
Summary¶
In summary, you should now be able to:
- Generate a table of assumptions
- Simulate a large dataset using the assumptions table
- Check summary statistics and visualise the data
- Create some simple new features
Comments
comments powered by Disqus